NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

https://doi.org/10.1109/ISCA59077.2024.00019

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Shah, Aashaka; Goiri, Íñigo; Maleki, Saeed; Bianchini, Ricardo (June 2024, IEEE)

Full Text Available
Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

https://doi.org/10.1145/3669940.3707226

Reidys, Benjamin; Zardoshti, Pantea; Goiri, Íñigo; Irvene, Celine; Berger, Daniel S; Ma, Haoran; Arya, Kapil; Cortez, Eli; Stark, Taylor; Bak, Eugene; et al (February 2025, ACM)

Free, publicly-accessible full text available February 3, 2026
Characterizing Power Management Opportunities for LLMs in the Cloud

https://doi.org/10.1145/3620666.3651329

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Goiri, Íñigo; Warrier, Brijesh; Mahalingam, Nithish; Bianchini, Ricardo (April 2024, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3)

Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss.
more » « less
Full Text Available
Designing Cloud Servers for Lower Carbon

https://doi.org/10.1109/ISCA59077.2024.00041

Wang, Jaylen; Berger, Daniel S; Kazhamiaka, Fiodar; Irvene, Celine; Zhang, Chaojie; Choukse, Esha; Frost, Kali; Fonseca, Rodrigo; Warrier, Brijesh; Bansal, Chetan; et al (June 2024, IEEE)

To mitigate climate change, we must reduce carbon emissions from hyperscale cloud computing. We find that cloud compute servers cause the majority of emissions in a general-purpose cloud. Thus, we motivate designing carbon-efficient compute server SKUs, or GreenSKUs, using recently-available low-carbon server components. To this end, we design and build three GreenSKUs using low-carbon components, such as energy-efficient CPUs, reused old DRAM via CXL, and reused old SSDs. We detail several challenges that limit GreenSKUs, carbon savings at scale and may prevent their adoption by cloud providers. To address these challenges, we develop a novel methodology and associated framework, GSF (GreenSKU Framework), that enables a cloud provider to systematically evaluate a GreenSKU’s carbon savings at scale. We implement GSF within Microsoft Azure’s production constraints to evaluate our three GreenSKUs’ carbon savings. Using GSF, we show that our most carbon-efficient GreenSKU reduces emissions per core by 28% compared to currently-deployed cloud servers. When designing GreenSKUs to meet applications’ performance requirements, we reduce emissions by 15%. When incorporating overall data center overheads, our GreenSKU reduces Azure’s net cloud emissions by 8%.
more » « less
Full Text Available
Dense Server Design for Immersion Cooling

https://doi.org/10.1145/3687965

Kodnongbua, Milin; Englhardt, Zachary; Bianchini, Ricardo; Fonseca, Rodrigo; Lebeck, Alvin; Berger, Daniel_S; Iyer, Vikram; Kazhamiaka, Fiodar; Schulz, Adriana (November 2024, ACM Transactions on Graphics)

The growing demands for computational power in cloud computing have led to a significant increase in the deployment of high-performance servers. The growing power consumption of servers and the heat they produce is on track to outpace the capacity of conventional air cooling systems, necessitating more efficient cooling solutions such as liquid immersion cooling. The superior heat exchange capabilities of immersion cooling both eliminates the need for bulky heat sinks, fans, and air flow channels while also unlocking the potential go beyond conventional 2D blade servers to three-dimensional designs. In this work, we present a computational framework to explore designs of servers in three-dimensional space, specifically targeting the maximization of server density within immersion cooling tanks. Our tool is designed to handle a variety of physical and electrical server design constraints. We demonstrate our optimized designs can reduce server volume by 25--52% compared to traditional flat server designs. This increased density reduces land usage as well as the amount of liquid used for immersion, with significant reduction in the carbon emissions embodied in datacenter buildings. We further create physical prototypes to simulate dense server designs and perform real-world experiments in an immersion cooling tank demonstrating they operate at safe temperatures. This approach marks a critical step forward in sustainable and efficient datacenter management.
more » « less
Faster and Cheaper Serverless Computing on Harvested Resources

https://doi.org/10.1145/3477132.3483580

Zhang, Yanqi; Goiri, Íñigo; Chaudhry, Gohar Irfan; Fonseca, Rodrigo; Elnikety, Sameh; Delimitrou, Christina; Bianchini, Ricardo (October 2021, SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles)

Full Text Available
Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing

https://doi.org/10.1145/3387514.3405867

Gao, Jiaqi; Yaseen, Nofel; MacDavid, Robert; Frujeri, Felipe Vieira; Liu, Vincent; Bianchini, Ricardo; Aditya, Ramaswamy; Wang, Xiaohang; Lee, Henry; Maltz, David; et al (July 2020, Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication)
null (Ed.)
Full Text Available

Search for: All records